Combining viral genomics and clinical data to assess risk factors for severe COVID-19 (mortality, ICU admission, or intubation) amongst hospital patients in a large acute UK NHS hospital Trust

Throughout the COVID-19 pandemic, valuable datasets have been collected on the effects of the virus SARS-CoV-2. In this study, we combined whole genome sequencing data with clinical data (including clinical outcomes, demographics, comorbidity, treatment information) for 929 patient cases seen at a large UK hospital Trust between March 2020 and May 2021. We identified associations between acute physiological status and three measures of disease severity; admission to the intensive care unit (ICU), requirement for intubation, and mortality. Whilst the maximum National Early Warning Score (NEWS2) was moderately associated with severe COVID-19 (A = 0.48), the admission NEWS2 was only weakly associated (A = 0.17), suggesting it is ineffective as an early predictor of severity. Patient outcome was weakly associated with myriad factors linked to acute physiological status and human genetics, including age, sex and pre-existing conditions. Overall, we found no significant links between viral genomics and severe outcomes, but saw evidence that variant subtype may impact relative risk for certain sub-populations. Specific mutations of SARS-CoV-2 appear to have little impact on overall severity risk in these data, suggesting that emerging SARS-CoV-2 variants do not result in more severe patient outcomes. However, our results show that determining a causal relationship between mutations and severe COVID-19 in the viral genome is challenging. Whilst improved understanding of the evolution of SARS-CoV-2 has been achieved through genomics, few studies on how these evolutionary changes impact on clinical outcomes have been seen due to complexities associated with data linkage. By combining viral genomics with patient records in a large acute UK hospital, this study represents a significant resource for understanding risk factors associated with COVID-19 severity. However, further understanding will likely arise from studies of the role of host genetics on disease progression.

Introduction filtered sequencing data for COG-UK samples are routinely deposited in the European Nucleotide Archive (ENA) at EMBL-EBI under accession PRJEB37886 (https://www.ncbi.nlm.nih.gov/ bioproject/?term=PRJEB37886). In addition, highquality consensus genome files with coverage greater than 90% are routinely deposited to the Global Initiative for Sharing of All Influenza Data (GISAID) database (https://gisaid.org/). Aggregated clinical data are provided within the manuscript and its Supporting Information files. Raw clinical data cannot be shared publicly because of risks to patient confidentiality. Data are available from the Portsmouth Hospitals University NHS Trust Institutional Data Access / Ethics Committee (contact via research.office@porthosp.nhs.uk) and may be made available for researchers who meet the criteria for access to confidential data. and outcomes) for all hospital admissions collected by the Portsmouth Academic Consortium For Investigating  team at Portsmouth Hospitals University NHS Trust (PHU). PHU saw a steep rise in COVID-19 cases over the winter of 2020, with 3,272 new hospital cases between September and February, and a peak of 539 positive inpatients, representing a national outlier for infections compared to the average peak of 219 in the South East (https://coronavirus.data.gov.uk/).
The aim of this study was to combine the COG-UK dataset and PACIFIC-19 Clinical Outcomes Research Group (CORG) database to develop a data resource linking clinical disease severity, therapeutic interventions, comorbidities and demographics to SARS-COV-2 genomic lineage data. One such metric, the National Early Warning Score 2 (NEWS2), provides a simple metric for identifying acutely ill patients and those requiring transfer to ICU [32,33]. It is calculated based on 6 physiological parameters recorded at the bedside (respiration rate, oxygen saturation, systolic blood pressure, pulse rate, level of consciousness or new-onset confusion, temperature), each assigned a score of 0-3 by the healthcare team, with a score greater than 7 suggesting a high-risk patient requiring emergency assessment by the critical care team. These data cover COVID-19 infections in the area between March 2020 and May 2021, including the major UK wave of COVID-19 over winter 2020, and were used to explore factors influencing clinical severity of COVID-19 and identify specific mutations or constellations of mutations associated with severe COVID-19. In particular, this time period covers the introduction of the first variant of concern (VOC) Alpha, known also by the Pangolin (https://covlineages.org/) lineage name B.1.1.7, allowing us to address whether the emergence of this lineage impacted on the clinical severity of COVID- 19. As global restrictions continue to flex in response to ongoing changes in case-loads, and we learn to live with the SARS-CoV-2 virus as new VOCs develop, it is increasingly important to look back at what we have learned to fully understand the factors associated with poor outcomes from COVID-19. There is significant motivation to further expand our knowledge of potential risk factors for severe COVID-19, especially where such factors may allow medical staff to predict a severe outcome of COVID-19 for early intervention. This study thus provides a significant resource for understanding the role that a variety of clinical factors and viral genomics play in determining patient outcomes.

Study sites
PHU is one of England's largest acute hospital trusts, serving the major coastal port city of Portsmouth and surrounding areas on the South Coast of the UK. The primary site for this study was Queen Alexandra Hospital (QAH), a research hospital within PHU with an 800-bed capacity treating >500,000 patients per year.

Laboratory diagnosis
Quantitative polymerase chain reaction (qPCR) COVID-19 tests for hospital staff, patients, and members of the local community within Portsmouth and surrounding areas were carried out at QAH. Samples were collected from participants using nasopharyngeal swabs and stored and transported in Sigma-Virocult 1 mL Viral Transport Media (VTM) (Medical Wire & Equipment, Corsham, UK).
Multiple clinically validated testing methods were used over the period of the study, following manufacturer's directions. These approaches include using the Panther system with the Aptima SARS-CoV-2 assay (Hologic, Marlborough, USA). This method involves automated RNA extraction and transcription-mediated amplification, providing a qualitative result to confirm the presence or absence of SARS-CoV-2 by amplifying two conserved regions of the SARS-CoV-2 ORF1ab gene, comparing the fluorescence signal to an internal control.
Additional testing was performed using the Anatolia Geneworks SARS-CoV-2 PCR v2 kit, which has 2 SARS-CoV-2 targets: ORF1ab and E gene alongside an internal control. VTM sample extraction was performed on the QIAsymphony SP/AS extraction system (Qiagen, Hilden, Germany) off-board lysis protocol (PATHOGEN, COMPLEX 200_OBL_V4_DSP) using the QIAsymphony DSP Virus/Pathogen Midi or Mini Kit and reverse transcription (RT) realtime qPCR amplification was performed on the LightCycler 480 II (Roche, Basel, Switzerland).
Additional rapid testing was conducted using the Xpert1 Xpress SARS-CoV-2 assay on the GeneXpert (Cepheid, California, USA), a cartridge-based system for rapid detection, extraction and amplification using real-time RT-qPCR to detect 2 targets for SAR-COV-2 in the N2 and E gene regions, alongside internal controls.

Sampling
All samples, including patients, healthcare workers (HCWs) and community cases tested for COVID-19 at PHU, were made available for viral extraction and whole genome sequencing. Samples from PHU were sequenced alongside samples from a wide range of NHS Trusts across the South Coast of the UK by the University of Portsmouth as part of the COG-UK consortium [34]. Where samples could not be sequenced due to limits in capacity, the COG-UK surveillance sampling strategy was applied to ensure that cases represented a random representation of currently circulating variants. Briefly, samples were selected either due to targeted sequencing priorities, such as HCWs for the SARS-CoV-2 Immunity & Reinfection Evalua-tioN (SIREN) study (https://snapsurvey.phe.org.uk/siren/), or were selected randomly from available samples each day up to local capacity.

Whole genome sequencing
Sequencing was conducted following the ARTIC nCoV-2019 sequencing protocol V.3 (LoCost) [35]. RNA was reverse transcribed and then amplified with amplicon PCR using the ARTIC nCoV-2019 V3 primer panel (Integrated DNA Technologies, Iowa, USA). This primer panel tiles the SARS-CoV-2 genome with 98 pairs of primers, each producing an amplicon of 500 bp. Odd-numbered primers were pooled separately from even-numbered primers to prevent over-amplification of overlapping amplicon regions.
Nuclease-free water (NFW) was used as a negative control on each sequencing run to assess contamination in the amplification stage. A synthetic SARS-CoV-2 RNA control (Twist Bioscience, San Francisco, CA, USA) was also added to each run as a positive control. To confirm sample quality and assess likely failures or contamination issues, positive and negative controls, along with representative samples from each run, were quantified using the Qubit DNA Assay Kit in a Qubit 2.0 Fluorometer (Life Technologies, California, USA).
The LSK-109 Ligation Sequencing Kit and EXP-NBD196 Native Barcoding Expansion 96 Kit from Oxford Nanopore Technologies (ONT, Oxford, UK) were used to generate libraries for Nanopore sequencing. Libraries were sequenced on R9.4.1 flow cells on a GridION X5 platform (ONT, Oxford, UK) for 24-36 hours (depending on library sample number) to achieve a final coverage of~100,000 reads per sample. Raw reads were demultiplexed by the MINKnow software on the GridION using Guppy v3.2.10.

Sample exclusion
If genome sequencing failed (e.g., as a result of the negative control showing evidence of PCR contamination), samples were repeated from scratch. If sufficient RNA was not available, samples were excluded from the study. Samples from PHU were also excluded if the participant involved indicated their retrospective desire to opt out from the study.
For the outcome analysis, further exclusions were also applied to the combined dataset. Samples where the sequence data covered less than 50% of the genome were excluded due to poor resolution of viral variant subclasses. Samples were also excluded for individuals aged less than 16 years old, individuals that were not admitted to the main hospital (e.g., residents of long-term care facilities), and individuals who had not yet completed their hospital stay. In cases where multiple samples were taken from a single individual, the sample with the highest genome coverage was taken forward for further analysis. This is summarised in S1 Fig.

Clinical outcome data
The PACIFIC-19 team at PHU holds a database of patient-specific information (e.g. demographics, COVID-19 status, illness severity scores, treatments and outcomes) for all hospital admissions, including COVID-19 positive patients, between January 2018 and May 2021. The PACIFIC-19 CORG database contains data collated from the Local Laboratory Information Systems (LIMS) using COGNOS for interrogation to identify all positive samples, and manually from the APEX Pathology LIMS. These data were linked to SARS-CoV-2 genome sequence data using the COG-UK sequencing codes and locally assigned sample source IDs.

Clinical data analysis
To maximise the number of near-complete entries usable for our analyses, we dropped data columns where 15% or more of the entries contained missing data. Imputation of missing values was not used to avoid significantly biassing the results.
Three main measures of severity as a result of COVID-19 infection were used in this analysis; patient death within 30 days of diagnosis, patient admission to ICU or intubation of the patient. In addition, we took a general measure of case severity based on the occurrence of at least one of these three outcomes.
For pair-wise associations between categorical variables, the association strength was calculated using Cramer's V score V (with bias correction) [38], based on the χ 2 statistic, with statistical significance calculated using the p-value from a χ 2 test [39]. For pair-wise associations between continuous variables, the correlation coefficient ρ and p-value from a Spearman's Rank test were used to determine the association strength and statistical significance respectively. For pair-wise associations between categorical and continuous variables, the association strength was determined using the Correlation Ratio η 2 [40].
To ensure no bias as a result of non-normally distributed data, the continuous variable was ranked prior to calculation. Statistical significance was determined using the p-value from a Kruskal-Wallis H test. In each case, the association strength score A was assumed to be negligible if |A| < 0.1, weak if 0.1 � |A| > 0.3, moderate if 0.3 � |A| > 0.5, and strong if |A| � 0.5. Associations were determined to be statistically significant when p < 0.05.

Machine learning for the identification of mutations associated with disease severity
Mutation information from sequencing experiments was numerically encoded as follows: 1 = wild-type, 2 = substitution, 3 = insertion, 4 = deletion. These data were linked to clinical data as input for machine learning models to further explore the role of viral mutations of SARS-CoV-2 on severity of disease in COVID-19. We screened nine machine learning models and one deep-learning neural network method to rank and identify mutations with a possible role in determining patient outcomes. Training of models and calculation of accuracy metrics were determined from 6-fold stratified cross-validation screening using Python V3.8.8 with TensorFlow V2. Data were proportioned into an 80:20 train-test split.
A binary-outcome variable for severity was defined based on mortality having occurred following escalation to the ICU. To address the imbalance in these data, with 3.2-fold fewer cases of mortality than survival, Synthetic Minority Oversampling (SMOTE) techniques were implemented. Hyperparameters were tuned for optimal performance using a Grid-search method while implementing 6-fold cross-validation. To get an overall view of the metrics incorporating both classes, precision, recall and F1 statistics were calculated for cases in the test set with outcome = 0 or outcome = 1 separately, with the macro-average scores calculated based on the mean of the two.
The best accuracy combined with minimal loss scores were obtained using the multi-layer perceptron artificial neural network (MLP-ANN), using the sequential API within Tensor-Flow. The input layer to the MLP-ANN introduces linear weighted input variables to the neurons in the hidden-layers. Dropout regularization was employed to offset the overfitting dilemma typically encountered in machine-learning models [41]. This approximates training of a large number of neural networks with different architectures in parallel, where a number of layers are randomly ignored or dropped out. Model accuracy and loss scores began to plateau by 4,000 epochs, so were run to 10,000 epochs to maximise the accuracy (S2

Patient demographics
The primary dataset used in this analysis combines viral genomics with clinical metadata for PHU. Following filtering of cases (see Materials and methods) and merging of the data sets, combined data for 929 individual patients were used for downstream analyses (S1 Fig). A breakdown of these data based on some of the key demographics and clinical factors can be seen in Table 1. Of these 929 cases, 360 (38.8%) showed severe outcomes (ICU admission, intubation or death within 30 days of diagnosis), with 569 (61.2%) showing non-severe outcomes. Looking at the severe outcomes in more detail, 295 (31.8%) patients died, 111 (11.9%) patients were admitted to ICU, and 93 (10.0%) required intubation in ICU. Of those patients on ICU, 46 (41.4%) also died, suggesting that the majority of fatalities (249; 84.4%) occurred outside of ICU, with 70 (23.7%) occurring outside of the hospital. However, the majority of these deaths (181; 61.4%) occurred in patients aged 80 or above, with only 5 admitted to ICU. In general, patients suffering severe outcomes were older, with a median age of 76 (IQR [63,85]), with 52.8% of cases between 70 and 90 years old. The split between male and female cases was relatively even, with 426 (45.8%) female compared with 503 (54.1%) male cases. The majority of all cases were of white ethnic background (701; 75.5%), 192 (20.7%) cases were of unstated or unknown ethnic origin and the remaining 36 (3.8%) cases were comprised of nonwhite ethnic minority groups.
At the time of COVID-19 diagnosis, almost half of all patients were inpatients (450 cases, 48.4%), with a large proportion being identified through the Emergency Department (ED; 360 cases, 38.8%). A smaller proportion of cases were identified in Critical Care (CC; 62 cases, 6.7%) and the Acute Medical Units (AMU; 33 cases, 3.6%). The majority of patients suffered from at least one of the comorbidities (801 cases, 86.2%) explored in this dataset; diabetes, hypertension, renal disease, malignancy (cancer), heart disease, asthma, or chronic obstructive pulmonary disease (COPD). Hypertension and heart disease were the most common, with 485 (52.2%) and 490 (52.7%) cases respectively, whilst asthma and cancer were rarer, with 101 (10.9%) and 108 (11.6%) cases respectively.

Associations with disease severity
To understand the factors that most affect disease severity (defined by either admission to ICU, receiving invasive mechanical intubation, or death within 30 days of diagnosis), pairwise statistical association analyses were performed for all experimental variables using either Cramer's V, Spearman's Rank or the Correlation Ratio, depending on the data types (see Materials and methods). S3 Fig shows the pairwise association score between all variables in the data set, and Table 2 shows those with a statistically significant (p < = 0.05) and non-negligible (A � 0.1) association with COVID-19 severity. A description of these data points is shown in S1 Table. In general, these data show that clinical variables show mild association with outcomes, and demonstrate a lack of individual strong indicators in our dataset that could potentially predict a severe case of COVID-19.
Pre-existing comorbidities also appeared to play a role in susceptibility for severe COVID-19, with a weak association seen for the number of pre-existing conditions a patient might have (A = 0.20, p = 6.81e-09), as well as a weak association to those who have any pre-existing conditions (A = 0.13, p = 2.05e-05; Fig 2A). These links were seen with the death outcome, but not with ICU admission nor intubation (Table 2). Specifically, those with renal disease (A = 0.19, p = 6.32e-09; Fig 2B) or heart disease (A = 0.12, p = 3.10e-04; Fig 2C) showed weak but statistically significant associations with COVID-19 severity, in particular death. These links therefore result in increased odds of having a severe case of COVID-19 (pre-existing con-  Fig 1C) and sex (A = 0.13, p = 6.05e-05; Fig 2A) of the patient also show statistically significant, albeit weak effects on COVID-19 severity. These data show a median age of 80 (IQR [68,86]) in severe cases compared to 74 (IQR  Fig 2D).
In addition, the length of stay (A = 0.17, p = 1.82e-07; Fig 1D) and time to discharge or death (A = 0.15, p = 6.14e-07) also appear to be statistically significant (albeit weak) factors, with a longer median stay of 16 days (IQR [8,26]) seen in severe cases compared to 11 days (IQR [5,20]) in non-severe cases. In particular, we see a long tail for long stays for severe cases, as a result of patients who contracted severe, but non-fatal, COVID-19 and required a significant amount of recovery time. Interestingly, both metrics were associated with ICU admission and intubation, but not with patient death (Table 2).   To further understand the effect of the Alpha lineage on disease severity, we looked at associations with disease severity for Alpha and non-Alpha cases separately (Fig 3). Fig 3A-3F explore how the Alpha variant may have impacted severity outcomes for those with pre-existing conditions, and specifically renal and heart disease, which were identified as being associated with disease severity (Table 2) Fig 4B), although a moderate decrease in odds was seen. Interestingly, the Alpha variant did have an impact on ICU admission (OR = 2.03, 95% CI [1.25, 3.30], p = 5.50e-03; Fig 4C) and whether intubation was required (OR = 2.32, 95% CI [1.36, 3.95], p = 2.51e-03; Fig 4E) for male patients, with almost twice the odds when compared to non-Alpha variants in both cases. However, no statistically significant impact on ICU admission (OR = 0.84, 95% CI [0.41, 1.73], p = 0.762; Fig 4D) nor whether intubation was required (OR = 1.02, 95% CI [0.48, 2.17], p = 1.00; Fig 4F) was seen with female patients.

SARS-CoV-2 mutations associated with severe COVID-19
As the SARS-CoV-2 virus has mutated over time, a number of key mutations have been identified, particularly in the spike protein of the virus. We used our dataset to identify whether any specific mutations or clusters of mutations could be identified that might be associated with an increased risk of a negative outcome, thus acting as predictors of outcome in future cases. Features from the joint outcomes and mutation dataset that showed statistically significant relationships with single nucleotide polymorphisms (SNPs) and deletions of the SARS-CoV-2 genome are shown in Table 3. We found no statistically significant link between the SNPs and any of our chosen indicators of disease severity (death, ICU admission, or intubation). Interestingly, we found a moderately weak link between the mutations and the NEWS2 score, both at admission (A = 0.25, p = 1.01e-15) and the maximum recorded score (A = 0.23, p = 1.27e-09). This may indicate that some mutations impact acute physiological status, but not enough to directly result in a severe case of COVID-19. This hypothesis is supported by the weak association between the mutations and the patient's length of stay (A = 0.25, p = 5.89e-16), indicating a weak link between the types of mutations found in patients who experienced symptoms of COVID-19 for longer periods of time. Also interestingly, we see that the mutations were associated with the number of pre-existing conditions the patient has (A = 0.24, p = 3.15e-06) as well as whether the patient has cancer (A = 0.14, p = 1.57e-13), renal disease (A = 0.13, p = 4.08e-11), or COPD (A = 0.11, p = 7.01e-08), although these associations are quite weak. We also see weak associations between the mutations and the demographics of the patient, in particular their ethnic origin (A = 0.15, p < 1.00e-300), sex (A = 0.12, p = 7.48e-08) and age (A = 0.10, p = 1.94e-283). Similar weak associations are also seen with the locations of the patient, particularly the admission speciality (A = 0.16, p < 1.00e-300), the location where the patient was swabbed (A = 0.14, p < 1.00e-300) and the ward the patient was located after admission to the hospital (A = 0.14, p < 1.00e-300), likely arising as a result of nosocomial spread of the virus within wards.

Machine learning (ML) and artificial neural network (ANN) analysis of mutations and comorbidity risk-factors associated with disease severity
To further explore the role of viral mutations of SARS-CoV-2 in severity of disease in COVID-19, we utilised machine learning approaches to identify mutations with a possible role in determining patient outcomes. Nine machine learning algorithms and one deep-learning neural network method were tested and ranked according to their accuracy (Fig 5A), with a binary outcome of death (outcome = 1) or no death (outcome = 0) following escalation of care to the ICU. Of these, the XGradient Boosted (XGBoost) and the Multi-Layer Perceptron Artificial Neural Network (MLP-ANN) approaches produced the best results, with the MLP-ANN model resulting in slightly improved accuracy (76.2%; Fig 5B, right) compared to XGBoost (74.6%; Fig 5B, left). Comparison of macro-average metric scores between the XGBoost and MLP-ANN models are shown in Fig 5B, with the breakdown for the different outcomes shown in Fig 5C. In both models, Precision and Recall were high in discrimination of COVID-19 patients with greater survival probability (low-risk patients) following their admission to intensive care units, but low for discrimination of high-risk patients. The MLP-ANN model showed a higher Recall (100% vs 92%), but lower Precision (74% vs 78%) for identification of low-risk patients compared to the XGBoost model. Whilst potentially unsuitable for identification of high-risk patients, this model may potentially offer an approach for exclusion of low-risk patients, allowing for the remaining cohort to be observed as potentially high risk, with resources prioritised for more intensive clinical surveillance, management, and attention.
The ranking of SNPs, deletions, and clinical criteria in the order of importance to the model (from top to bottom), based on the Shapley additive explanation (SHAP) values, is shown for the XGBoost model ( Fig 5D) and the MLP-ANN model (Fig 5E). For mutations identified as the top predictors variables, the numeric ID indicates the nucleotide position of the nucleotide change relative to the reference SARS-CoV-2 genome, and the gene domain in which the mutation is located is highlighted (S4 Fig).
The feature importance of the predictor variables is different for the XGBoost model compared to the MLP-ANN model, with individual genomic features in the XGBoost model typically showing higher mean SHAP values than those in the MLP-ANN model. Another striking difference is in the ranking of clinical variables, which represent the top ranked features in the XGBoost model, but appear less significant than a number of mutations (in particular deletions) in the MLP-ANN model. Renal disease, heart disease and diabetes feature for both models, with risk factors such as history of hypertension, COPD, prior history of malignancy, and asthma prominent in the XGBoost model, but not in the MLP-ANN model.
One notable similarity between the two models, SNP 23403, corresponds to an A->G mutation at site 23,403, resulting in an Aspartic Acid to Glycine amino acid substitution (D614G) within the spike (S) protein domain. This SNP, seen in the top 20 for both models, became rapidly dominant globally due to increased viral fitness and higher viral loads [42][43][44][45]. It became fixed in the population after the first wave in the UK, with 0% of cases showing the Glycine residue in January 2020, rapidly increasing to 70% of cases in May 2020 [45]. It is therefore likely that association of this SNP with severity is closely linked with temporal developments of the pandemic, with significant improvements in treatment options and vaccine developments from the second wave onwards.
Whilst the spike protein has been linked with increased viral load and fitness [46], and thus may represent an obvious source for identification of mutations linked with disease severity due to its role in host cell receptor binding, the majority of the identified mutations actually seem to lie in other regions of the viral genome. In particular, for the XGBoost model, the  1.7, and clearly delineate Alpha from non-Alpha cases. Similarly, the majority of the remaining mutations identified by the MLP-ANN model appear to be highly specific to the Alpha lineage, indicating that this model primarily identifies the presence of the Alpha lineage as being associated with disease severity. As with SNP 23403, this is likely linked to temporal development in treatment options for those with Alpha later in the pandemic compared to cases with severe symptoms in earlier waves.

Discussion
As the world returns to a more normal state after being plunged into a global pandemic, many questions remain to be answered about COVID-19. In particular, it is still not well understood exactly which factors are most associated with the likelihood of an individual suffering from the most significant negative outcomes, including long-term post-COVID-19 respiratory issues ("long COVID"), requirement for invasive mechanical intubation, admission to ICU, or even death.
In this study, clinical data were linked to viral genomic data from patients seen across an acute NHS Trust on the south coast of the UK. This data resource was used to explore potential links between severe outcomes and viral subtype, patient demographics, and clinical history, to further understand factors that may influence patient responses to the virus. Overall, this study found no strong factors associated with severe cases of COVID-19, instead showing weak influence from myriad factors including age, sex, and existence of pre-existing conditions.
Of course, certain pre-existing conditions are more likely than others to directly influence COVID-19 illness. For example, given that cataracts are typically seen in older individuals, many of those most clinically vulnerable for severe COVID-19 outcomes may suffer from cataracts, with one-fifth of patients awaiting cataract surgery found to be at high risk of severe disease or death from COVID-19 in a 2022 study [47]. However, whilst a serious malady and a leading cause of preventable blindness, suffering from cataracts is itself unlikely to have a significant bearing on the severity of COVID-19 pneumonia. The context of the comorbidity in relation to the subsequent pathology of SARS-CoV-2 pathophysiology is important, since the primary target organs are the lungs, and pathophysiological progression may require mechanical ventilation in areas where high-dependency or intensive care is offered.
One of the frequently observed disease progressions in COVID-19 is the persistence of micro-coagulopathy, where tiny clots systemically occlude capillaries, such as in the glomeruli [48]. Thus, a patient with a pre-existing compromised renal function, or those with pre-existing cardiac dysfunction (especially previous coronary ischaemia) might show poor recovery trajectories in hospital. It is thus clear that certain pre-existing conditions, particularly renal and heart disease, may make an individual more likely to suffer from severe complications with COVID-19 and have been previously identified as risk factors [49]. Indeed, as shown in Table 2, renal disease (A = 0.15), heart disease (A = 0.14), and cancer (A = 0.11) were identified as being significantly associated with the likelihood of death. Such patients should therefore continue to be monitored closely, to observe signs of deterioration.
It is worth noting however, that the absolute increase to risk is low based on our data, and a relatively large proportion of those analysed suffered from heart disease (52.7%), renal disease (37.2%), and cancer (11.6%) ( Table 1). Indeed, only 13.8% of patients in our dataset had no pre-existing condition at all, highlighting a significant selection bias in the data. There is also a selection bias with admission age, with 79.3% of patients aged 60 and above and a median age of 76. These biases are likely closely related, since older patients typically experience a higher proportion of comorbidities compared to younger age groups [9]. These selection biases may impact other association scores, potentially resulting in underestimated scores for pairwise associations with admission age and comorbidities.
These data also highlight that acute physiological derangement of the patient is linked to severe COVID-19, indicated by a moderate-strong association between the maximum NEWS2 score and whether the patient died within 30 days of diagnosis (A = 0.43), was admitted to ICU (A = 0.21), or required invasive mechanical intubation (A = 0.21) ( Table 2). The NEWS2 score reports on a constellation of dynamically changing (particularly within an acute setting) clinical features, but is a simple to calculate metric to identify and address patient deterioration [32,33], and has been previously identified as a potential screening tool for severe patient outcomes [50][51][52]. However, a UK multicentre study identified poor to moderate discrimination of medium-term COVID-19 outcomes from NEWS2 scores and age alone, calling into question its use as a screening tool [53]. A common observation with COVID-19 is of mild phenotypes deteriorating towards severe phenotypes (resulting in an increased NEWS2 score) as a result of the respiratory distress caused by COVID-19 pneumonitis. In comparison, the NEWS2 score given to a patient on admission shows no significant association with death, suggesting that it is unlikely to represent a significant predictive factor for COVID-19 severity. A weak association is seen between admission NEWS2 and both ICU admission (A = 0.24) and intubation (A = 0.24), but given that the NEWS2 score is often a tool used to determine whether a patient has deteriorated sufficiently to require intubation or ICU treatment, this is perhaps unsurprising. Length of stay also showed moderate associations with ICU admission (A = 0.30). There are several risk factors which become apparent with an increased length of hospital stay, for instance the likelihood of the patient being on prolonged prescription of several non-routine medications. These include medication for prevention of venous thromboembolism (heparin and other anticoagulants), medications to aid somnolence at night (sleep medication is frequently requested by the elderly while at hospital due to unfamiliar disturbing noises at night in a busy clinical environment), antibiotics, anti-anxiety medication, medication to help bowel movements (due to prolonged bed-rest and immobility), medication to offload water retention from immobility (again from prolonged bed-rest) and pain medication.
Other factors showing moderate association with ICU admission included the location category of their treatment ward (A = 0.66), the specific ward number (A = 0.49), and the ward in which their COVID-19 test swab was collected (A = 0.43). Patients at PHU are triaged and risk-stratified on admission, and the location of the clinical setting that they are initially taken to for treatment would reflect the clinical need for specialist services, equipment or staff-training levels distributed within a particular sector within the hospital. Such a sector is typically populated with a high number of patients needing high-dependency care and treatment. Since aerosolization of the virus is a potential and proven risk, along with the potential for direct transmission from person to person, nosocomial spread within such high-dependency care units results in increased cases within these areas. It is therefore likely that associations of outcomes with location-related data are a result of localized outbreaks, resulting in cases with shared mutation patterns between patients who share similar treatment and comorbidity characteristics. Indeed, nationally over 15% of all cases have been estimated as having been hospital acquired in the first wave in the UK [54], with up to 20% of infections in inpatients and 73% in HCW due to nosocomial transmission [55]. It has been suggested that up to 80% of nosocomial infections were caused by only 20% of patients due to "super-spreader" events [14], with such rapid outbreak dynamics having been previously characterised in at least one outbreak at PHU [56].
One key question to address as new variants of SARS-CoV-2 continue to arise is the effect on severity of the disease as a result of new variants. Whilst the data described here do not span the emergence of variants such as Delta and Omicron, they do represent the emergence and subsequent rapid expansion of the first VOC, Alpha (B. 1.1.7). Increased prevalence of Alpha in the local region led to increased transmission of a range of currently circulating variants within the hospital [56]. Interestingly, Table 1 shows that the rate of severe cases amongst Alpha cases (36.9%) was actually slightly lower than amongst non-Alpha cases (40.2%), suggesting that Alpha cases may present a lower risk of severe outcomes in our dataset compared to other variants (Table 1). However, whilst lineage was weakly associated with ICU admission (A = 0.18), we otherwise saw no statistically significant links between lineage and death, intubation, nor case severity in general ( Table 2). In addition, we identified changes to the odds of severe outcomes for cases of the Alpha VOC compared to other circulating variants for certain sub-populations. In particular, whilst the risk of severe outcomes was significantly higher amongst males compared to females in general (OR = 1.81), which is consistent to previous studies [57][58][59][60][61], the overall risk showed a moderate (although not significant) reduction in cases of the Alpha variant when compared to other cases for females (OR = 0.65) but not males (OR = 1.17). Looking specifically at our three severity indicators identified a mild non-significant decreased risk for mortality amongst females, but in contrast showed a significant increase in risk in males for admission to ICU (OR = 2.03) and intubation (OR = 2.32).
Overall, these results suggest that whilst the Alpha variant had no significant impact on COVID-19 severity overall, specific subgroups of the population may be more or less impacted by specific variants of the COVID-19 virus over others. Differences in the impact of SARS--CoV-2 infections between males and females has been suggested to result from differences in the expression of angiotensin converting enzyme (ACE2) receptors [62]. Indeed, circulating ACE2 levels have been shown to be higher in men, as well as in those with diabetes and preexisting cardiovascular conditions [63]. The study of Stirrup et al [26], a large-scale multi-centre study in the UK, also found that overall hazard of mortality and ICU admission were not significantly affected in cases of Alpha compared to other lineages, but that sex-specific effects may be present. Interestingly, however, they showed that it was women specifically that showed increased risk of mortality and ICU admission in their cohort. Increased mortality appeared to be specific to those 70 years and above, with a slight decrease seen in 50-69 year olds. One possible explanation for this discrepancy may therefore be in differences in the age profiles of those included in the two studies. Another possible explanation may be that our dataset contains cases from across the entire course of the pandemic, including the first UK wave where risk of severe outcomes was higher as a result of a lack of identified treatment and vaccine options. Indeed, a recent large-scale study of 30 million people in the UK showed that risk of severe COVID-19 outcomes is reduced as a result of ongoing vaccine programs [31]. However, our result remains when focussing only on cases from September 2021, indicating that wave 1 patients do not affect the outcome data. It is worth also noting that whilst the Alpha variant data are largely homogenous, significant heterogeneity exists in the non-Alpha data, with cases coming from 46 distinct lineages in these data. Another difference may be with respect to the population under consideration, since Stirrup et al was a multi-centre study, primarily from hospitals within London (although did include data from the nearby city of Southampton). Both studies however point towards the role of Alpha in disease severity being context specific and mild overall.
Linkage of WGS and clinical data represents a powerful approach for assessment of the effects of Alpha on severity, in comparison to studies which used surrogate measures such as S-gene target failure (SGTF) in qPCR tests to differentiate Alpha from other lineages. Indeed, other studies based on community testing and SGTF have shown conflicting results, with studies showing increased risks of Alpha, but no difference in the effects of Alpha on mortality [64,65] or ICU admission [65] between male and female cases. Thus, the evidence for increased severity of the Alpha variant of concern remains inconclusive [66]. Beyond the role of VOCs in determining disease severity, we sought to identify potential mutations or mutation clusters associated with patients who suffered severe outcomes. Whilst we found no significant link between lineage and overall severity, we did find a weak link between the mutation type and the NEWS2 scores given to the patient at admission (A = 0.25) and the maximum score assigned (A = 0.23) ( Table 3). Whilst this may indicate that there are mutations associated with patient health and physical derangement, it is also possible that such links relate to nosocomial transmission of the disease amongst clinically vulnerable patients, as previously discussed. This is further suggested given that the association is mostly enriched for mutations associated with non-severe outcomes.
To explore this in more detail, we utilised a range of machine learning models with individual mutations encoded alongside other patient factors, to further explore associations with patient mortality. Deep learning models have previously been developed for use in the diagnosis and screening of COVID-19 through interrogation of CT and chest X-ray images [67]. The two models with highest accuracy, XGBoost and MLP-ANN, were compared to identify features most linked with mortality. Renal disease, heart disease and diabetes feature for both models, with risk factors such as history of hypertension, COPD, prior history of malignancy, and asthma prominent in the XGBoost model, but not in the MLP-ANN model. The stochastic nature of algorithms such as XGBoost and MLP-ANN models means a degree of randomness exists, contrasted with deterministic algorithms such as linear regression or logistic regressionbased models. Regardless, it is clear that comorbidities are amongst the features most closely associated with disease severity. Whilst the XGBoost model identified comorbidity status and sex as being most predictive of severity (Fig 5D), the MLP-ANN identified a number of deletions as being the features with the most impact ( Fig 5E). These deletions were all specific to the Alpha variant B.1.1.7, including the Δ69-70 deletion on the Spike protein responsible for SGTF in qPCR testing for Alpha [68][69][70][71].
These deletions are therefore likely identified by the model as surrogates for Alpha vs non-Alpha cases. Whilst this may indicate that Alpha may be associated with mortality, this is not borne out when looking at male and female cases individually (Fig 4). This is therefore likely the result of non-Alpha lineages primarily representing cases from earlier in the pandemic, but may also be linked to selection bias due to Alpha being over-represented in these data. Similarly, the well documented D614G mutation was identified by both models, which was introduced at low levels during the first wave of infections in the UK, but became dominant and fixed in the population in subsequent waves [45]. This mutation is also linked with the temporal nature of the pandemic, with severity often being worse in earlier waves due to the lack of treatment options, reduced testing and interventions, and lack of vaccine program. It is therefore likely that these mutations are highlighting differences between cases early and later in the pandemic, rather than inherently having a functional role in increasing disease severity.
Overall, our analysis indicates that there are no clear strong factors that determine severe outcomes from COVID-19 (mortality, ICU admission or intubation). Whilst we detected a number of significant associations, most were mild and could be explained due to conflation with either general patient health, their location within the hospital, or changes in our treatment capabilities for the disease throughout the pandemic. It has been previously shown that comorbidities such as cancer, renal disease and heart disease are linked to negative outcomes, particularly mortality [49]. Also, whilst it is interesting to note that the NEWS2 score showed significant association with disease outcomes, these are not suitable for prediction of outcomes as discussed above. Similarly, the characteristics of the viral variant at the root of the infection is unlikely to present a suitable predictive tool for determining disease outcomes. Whilst there was some evidence of effects on severity from the Alpha variant compared to other circulating variants, the effect was inconsistent, with both increase and decrease in severity seen, sometimes at odds with previous studies.
Whilst this study focuses on only the Alpha variant, and thus cannot draw conclusions for further VOCs such as Delta and Omicron, these results suggest that within these data the introduction of the Alpha variant did not have a significant impact on severity of the disease. Of course, these data represent only a limited population, with 929 patient samples from a single hospital site. One other key limitation of this study is that the demographics of the patient cohort are skewed for those of the local area, in particular with over 75% of those in the study being of a white background (Table 1). These results may therefore not be generalisable to the population as a whole. However, despite these limitations, our study represents a useful and in-depth interim exploration of the effects on disease severity in response to both clinical measures and viral genomics. Recently, a large-scale analysis of over 1 million patients in England showed lower or similar risks of death, hospital admission and hospital attendance between the BA.1 and BA.2 Omicron variants [30], matching our observation that emerging SARS--CoV-2 variants do not result in more severe outcomes for patients.
Since our data indicate that virus genomics have limited impact on disease severity, it is likely that understanding of those most susceptible to severe outcomes when infected by SARS-CoV-2 (beyond clinically vulnerable individuals) will come from studies such as the GenOMICC study in the UK (https://genomicc.org/about/), which aim to understand the interaction between virus and host, and explore genetic factors in humans that dictate disease outcomes. Indeed, multiple studies have already been conducted identifying potential susceptibility loci in the human genome that may put patients at increased risk of death or other severe outcomes, including mutations in genes linked to immune response, blood clotting and mucus production [72][73][74][75]. In particular, a recent study using machine learning approaches such as XGBoost identified variants from whole exome sequencing associated with severe COVID-19 [76]. These data identified associations between age, gender, and 16 variants linked to immune system and inflammatory processes able to predict severe outcomes with high accuracy. Such studies will help to further understand the factors that predispose individuals to severe outcomes from SARS-CoV-2 infection.
As society accustoms itself to a "new normal" way of life, we are learning to live with endemic COVID-19. New variants will continue to emerge, and it is therefore imperative that we learn what we can from existing data. It is particularly important for us to understand how the most severe disease cases arise, in the hope that we may target such cases specifically and early. Studies like this which combine clinical and laboratory data, will thus be essential to that task.

Conclusion
Whilst many risk factors for severe COVID-19 have been identified, the precise mechanisms resulting in severe outcomes for those infected by SARS-CoV-2 (including admission to ICU, the need for mechanical ventilation, and mortality) remain poorly understood. In this study, we aimed to combine genomic sequencing data of SARS-CoV-2 viral variants with an extensive database of patient records to further understand those factors most associated with severe outcomes. In particular, we were interested to understand the precise role played by mutations in the virus itself, and whether infection with certain variants or viruses with specific mutations might be more likely to cause severe disease. Whilst patient outcome was weakly associated with factors linked with acute physiological status and human genetics, including age, sex and pre-existing conditions, our data suggest that severity risk is not significantly impacted by specific mutations in SARS-CoV-2. It is therefore likely that risk of severe outcomes results from a combination of patient health and innate genetic predisposition. Thus, whilst studies such as ours significantly further our understanding of the pathophysiology of the virus, ongoing studies exploring the role of host genetics on disease progression will continue to disentangle the complex factors that might increase risk to those infected with SARS-CoV-2. The sequential method in Tensorflow v2.8 was used, incorporating the Adam optimization algorithm for stochastic gradient descent for training of deep learning models. Parameters used were a learning rate of 0.0001, with beta_1 = 0.9 and beta_2 = 0.799. Following initial stages of 10,000 epochs, the model was refined and optimised for the appropriate number of nodes and hidden layers, and an "early stopping" protocol was incorporated to stop training once the model performance stopped improving. This was determined using a concurrent evaluation of cross-validation loss remaining similar over 20 epochs, and ensured minimal over-fitting and improving computing time. The two graphs here show close convergence and agreement between the train and validation sets of the MLP-ANN model. (b) The final architecture of the MLP-ANN model. The model contained 3 hidden layers (with 700, 700 and 10 nodes each), and a final output layer containing two nodes to pipe the categorical outcomes of 0 (no-death) and 1 (death). The number of optimal nodes were optimised over several runs of model building and hyperparameter optimisation steps. The final layer used "softmax" as the activation step, which scales numbers/logits into probabilities. The activation steps for the hidden layers were ReLU, used specifically to address the problem of vanishing gradients in deep-learning models. Dropout regularization was employed to reduce overfitting of the model, where different sets of neurons are dropped from the architecture, giving an overall result akin to training and optimizing multiple neural networks simultaneously.